import dalex as dx
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
data = pd.read_csv('hotel_bookings.csv')
data.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
# in order to simplify plots I decided to use only a subset of variables
data = data[['is_canceled', 'lead_time', 'arrival_date_year', 'adults', 'children', 'babies', 'booking_changes']]
data = data.dropna()
X, y = data.loc[:, data.columns != 'is_canceled'], data[['is_canceled']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X_train, y_train)
RandomForestClassifier(max_depth=2, random_state=0)
# making prediction on unseen data
observation = X_test.iloc[0,:].to_frame().transpose()
clf.predict(observation)
array([0], dtype=int64)
observation = pd.DataFrame({'lead_time': [203.0],
'arrival_date_year': [2016.0],
'adults': [2.0],
'children': [0.0],
'babies': [0.0],
'booking_changes': [4.]},
index = ['observation'])
exp = dx.Explainer(clf, X_train, y_train)
Preparation of a new explainer is initiated -> data : 107447 rows 6 cols -> target variable : Parameter 'y' was a pandas.DataFrame. Converted to a numpy.ndarray. -> target variable : 107447 values -> model_class : sklearn.ensemble._forest.RandomForestClassifier (default) -> label : Not specified, model's class short name will be used. (default) -> predict function : <function yhat_proba_default at 0x0000019C693170D0> will be used (default) -> predict function : Accepts pandas.DataFrame and numpy.ndarray. -> predicted values : min = 0.139, mean = 0.37, max = 0.45 -> model type : classification will be used (default) -> residual function : difference between y and yhat (default) -> residuals : min = -0.45, mean = -3.51e-05, max = 0.861 -> model_info : package sklearn A new explainer has been created!
exp.predict(observation)
array([0.3079201])
bd_observation = exp.predict_parts(observation, type='break_down', label=observation.index[0])
bd_observation.result
| variable_name | variable_value | variable | cumulative | contribution | sign | position | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | intercept | 1 | intercept | 0.370320 | 0.370320 | 1.0 | 7 | observation |
| 1 | lead_time | 203.0 | lead_time = 203.0 | 0.405468 | 0.035148 | 1.0 | 6 | observation |
| 2 | adults | 2.0 | adults = 2.0 | 0.412069 | 0.006601 | 1.0 | 5 | observation |
| 3 | babies | 0.0 | babies = 0.0 | 0.412456 | 0.000386 | 1.0 | 4 | observation |
| 4 | children | 0.0 | children = 0.0 | 0.412374 | -0.000082 | -1.0 | 3 | observation |
| 5 | arrival_date_year | 2016.0 | arrival_date_year = 2016.0 | 0.410609 | -0.001765 | -1.0 | 2 | observation |
| 6 | booking_changes | 4.0 | booking_changes = 4.0 | 0.307920 | -0.102689 | -1.0 | 1 | observation |
| 7 | prediction | 0.307920 | 0.307920 | 1.0 | 0 | observation |
bd_observation.plot()
# lead_time variable is equal to number of days between
# entering date and arrival(or cancelation) date
sh_observation = exp.predict_parts(observation, type='shap', B = 10, label=observation.index[0])
sh_observation.plot(bar_width = 16)
# big number of booking changes contributes negatively to final result
# big lead time however makes cancelation more likely
observation1 = pd.DataFrame({'lead_time': [4.0],
'arrival_date_year': [2015.0],
'adults': [2.0],
'children': [0.0],
'babies': [1.0],
'booking_changes': [0.0]},
index = ['observation1'])
sh_observation1 = exp.predict_parts(observation1, type='shap', B = 10, label=observation1.index[0])
sh_observation1.plot(bar_width = 16)
# smaller lead time makes cancelation less likely
# same applies for number of babies
# also absence of booking changes contibutes positively to final result